Once upon a time we looked at classifying PE and Mach-O files. This time it's flipped on its head. Is it possible to use various clustering algorithms to group similar files together? But, why stop there!? Can we crank up the awesome and use information from those clusters to generate Yara signatures to find files that are similar in nature?
In this notebook we'll explore not only gathering static information from PE files, but clustering on those attributes, and finally show off the capabilities of the Yara signature generation.
In [28]:
# All the imports and some basic level setting with various versions
import IPython
import re
import os
import json
import time
import string
import pandas
import pickle
import struct
import socket
import collections
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pefile
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
print "IPython version: %s" %IPython.__version__
print "pandas version: %s" %pd.__version__
print "numpy version: %s" %np.__version__
%matplotlib inline
In [3]:
def get_lang_value(lang):
for key, value in pefile.LANG.iteritems():
if value == lang:
return key
return 0
In [4]:
# Grab from the json data what we want
def extract_features(filename, data):
feature = {}
feature['filename'] = filename[26:-8]
feature.update(data['verbose']['pefile']['file header'])
feature.update(data['verbose']['pefile']['optional header'])
feature['image base'] = float(feature['image base'])
feature['size of stack reserve'] = float(feature['size of stack reserve'])
feature['size of stack commit'] = float(feature['size of stack commit'])
feature['size of heap reserve'] = float(feature['size of heap reserve'])
feature['size of heap commit'] = float(feature['size of heap commit'])
if 'size of image base var' in feature:
del feature['size of image base var']
if 'data directories' in data['verbose']['pefile']:
for k,v in data['verbose']['pefile']['data directories'].iteritems():
feature['data dir ' + k + ' rva'] = v['rva']
feature['data dir ' + k + ' size'] = v['size']
'''
if 'sections' in data['verbose']['pefile']:
for idx, sec in enumerate(data['verbose']['pefile']['sections']):
feature['section ' + str(idx) + ' virtual address'] = sec['virtual address']
feature['section ' + str(idx) + ' virtual size'] = sec['virtual size']
if idx == 2:
break
'''
if 'resources' in data['verbose']['pefile']:
feature['number of resources'] = len(data['verbose']['pefile']['resources'])
for index, resource in enumerate(data['verbose']['pefile']['resources']):
feature['resource ' + str(index) + ' lang'] = get_lang_value(resource['lang'])
feature['resource ' + str(index) + ' size'] = resource['size']
feature['resource ' + str(index) + ' rva'] = resource['rva']
if index == 2:
break
return feature
In [5]:
def extract_vtdata(filename, data):
vt = {}
vt['filename'] = filename[26:-7]
if 'scans' in data:
if data['positives'] > 0:
vt['label'] = 'malicious'
else:
vt['label'] = 'nonmalicious'
vt['positives'] = data['positives']
if 'Symantec' in data['scans']:
vt['symantec'] = data['scans']['Symantec']['result']
if 'Sophos' in data['scans']:
vt['sophos'] = data['scans']['Sophos']['result']
if 'F-Prot' in data['scans']:
vt['f-prot'] = data['scans']['F-Prot']['result']
if 'Kaspersky' in data['scans']:
vt['kaspersky'] = data['scans']['Kaspersky']['result']
if 'McAfee' in data['scans']:
vt['mcafee'] = data['scans']['McAfee']['result']
if 'Malwarebytes' in data['scans']:
vt['malwarebytes'] = data['scans']['Malwarebytes']['result']
else:
vt['label'] = 'nonmalicious'
vt['positives'] = 0
return vt
In [6]:
def load_files(file_list):
import json
features_list = []
for filename in file_list:
with open(filename,'rb') as f:
features = extract_features(filename, json.loads(f.read()))
features_list.append(features)
return features_list
import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.results')
features = load_files(file_list)
print "Files:", len(file_list)
In [7]:
def load_vt_data(file_list):
import json
features_list = []
for filename in file_list:
with open(filename,'rb') as f:
features = extract_vtdata(filename, json.loads(f.read()))
features_list.append(features)
return features_list
import glob
file_list = glob.glob('pefile_clustering_bsidelv/*.vtdata')
vt_data = load_vt_data(file_list)
In [8]:
df = pd.DataFrame.from_records(features)
for col in df.columns:
if 'resource' in col[0:7]:
df[col].fillna(-1, inplace=True)
df.fillna(-1, inplace=True)
df.head(5)
Out[8]:
In [9]:
df_vt = pd.DataFrame.from_records(vt_data)
df_vt.fillna('No detection', inplace=True)
df_vt.head(5)
Out[9]:
In [10]:
cols = [x for x in df.columns.tolist() if x != 'filename']
In [11]:
X = df.as_matrix(cols)
from sklearn.preprocessing import scale
X = scale(X)
from sklearn.decomposition import PCA
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)
In [12]:
from mpl_toolkits.mplot3d import Axes3D
figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], s=50)
ax.set_title("Features in 3D")
ax = fig.add_subplot(1, 2, 2)
ax.scatter(DD[:,0], DD[:,1], s=50)
ax.set_title("Features in 2D")
plt.show()
First up is DBSCAN, it enjoys long walks on the beach, non-flat geometry, and uneven cluster sizes (http://scikit-learn.org/stable/modules/clustering.html). This seemed like a good selection for many different reasons. We expect to have several uneven cluster sizes as this sample of files contains both malware and nonmalicious binaries. By building the features from the file structure, this should pick out several different tool chains (compilers, etc...) used and it would be surprising to have even distributions of that type of information in the data set. Hopefully we will even be able to cluster malware families together. Another nice feature of the scikit learn implementation is that all samples that don't belong to a cluster are labeled with "-1". This avoid shoving files into clusters and reducing the efficency of any generated Yara signature. However, if we're searching for more generic sigs we can play games to get more samples in clusters or use different algoritms.
We also show the difference between non-scaled and non-reduced data, and how you can get different (and usually better) results by scaling and reducing.
In [17]:
from sklearn.cluster import DBSCAN
X = df.as_matrix(cols)
dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
dbscan_df = df[['filename','cluster']]
print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()
We can see without PCA just about everything is unlabeled. Let's try again but using PCA. First we determine how many dimensions to reduce to, then we cluster.
In [18]:
X = df.as_matrix(cols)
X = scale(X)
pca = PCA().fit(X)
n_comp = len([x for x in pca.explained_variance_ if x > 1e0])
print "Number of components w/explained variance > 1: %s" % n_comp
In [19]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)
dbscan = DBSCAN(min_samples=3)
dbscan.fit(X)
labels1 = dbscan.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
dbscan_df = df[['filename','cluster']]
print "Number of clusters: %d" % nclusters
print "Labeled samples: %s" % dbscan_df[dbscan_df['cluster'] != -1].filename.value_counts().sum()
print "Unlabeled samples: %s" % dbscan_df[dbscan_df['cluster'] == -1].filename.value_counts().sum()
Half the files ended up unclustered, so that's a little disappointing, but still a huge improvement.
In [20]:
dbscan_df.cluster.value_counts().head(10)
Out[20]:
Let's see these clusters in 3D and 2D now.
In [21]:
# Remove unlabeled samples for graphing to make it prettier
tempdf = df[df['cluster'] != -1].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
DDD = PCA(n_components=3).fit_transform(X)
DD = PCA(n_components=2).fit_transform(X)
figsize(12,12)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(2, 2, 1, projection='3d')
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 2, projection='3d')
ax.set_xlim(-5,5)
ax.set_ylim(-5,15)
ax.set_zlim(-5,5)
ax.scatter(DDD[:,0], DDD[:,1], DDD[:,2], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
ax = fig.add_subplot(2, 2, 3)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters")
ax = fig.add_subplot(2, 2, 4)
ax.set_xlim(-3,4)
ax.set_ylim(-5,7)
ax.scatter(DD[:,0], DD[:,1], c=tempdf['cluster'], s=50)
ax.set_title("DBSCAN Clusters (zoomed in)")
plt.show()
Let's see how well DBSCAN did. To this end, we use data from VirusTotal to help us.
In [22]:
dbscan_vt_df = pd.merge(dbscan_df, df_vt, on='filename', how='outer')
dbscan_vt_df.head()
Out[22]:
In [23]:
clusters = set()
print "Total Number of Clusters: %s\n" % (len(dbscan_vt_df['cluster'].unique().tolist()))
for name, blah in dbscan_vt_df.groupby(['cluster', 'label'])['label']:
if name[0] in clusters:
print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
clusters.add(name[0])
In [24]:
dbscan_cluster_results = dbscan_vt_df.groupby(['cluster', 'label']).count()
dbscan_cluster_results[['filename']].head(10)
Out[24]:
In [25]:
dbscan_vt_df[dbscan_vt_df['filename'] == 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c']
Out[25]:
In [26]:
cluster_dc2 = dbscan_vt_df[dbscan_vt_df['cluster'] == 29]
cluster_dc2[['f-prot', 'mcafee', 'symantec', 'sophos', 'kaspersky', 'malwarebytes']]
Out[26]:
Below you'll see a simple call-out to a yara_signature python module. This module contains code to generate a signature based on attributes found in the file. We've chosen a cluster (3) and a file from that cluster to base the signature off of. Then the attributes from the cluster that are non-zero (present) are added to the signature. Some of the struct values can be influenced in the sig, and that's the reason for the multiple lists to keep track of various attributes.
In [29]:
import yara_signature
import struct
name = 26
fdf = pd.DataFrame()
for f in dbscan_df[dbscan_df['cluster'] == name].filename.tolist():
fdf = fdf.append(df[df['filename'] == f], ignore_index=True)
# Choose a signature from cluster to use as the basis of the sig w/the attributes below
filename = 'dc2ecab3759956a2c87da411c1ecce32fe2b71d8ade00d0dadbd460de91b411c'
meta = {"author" : "dorsey", "email" : "dorsey_at_clicksecurity_dot_com"}
sig = yara_signature.yara_pe_generator.YaraPEGenerator('./'+filename, samplename="Cluster_"+str(name), meta=meta)
file_header_columns = ["pointer to symbol table", "characteristics", "number of symbols", "size of optional header",
"machine", "compile date", "number of sections"]
optional_header_columns = ["subsystem", "major image version", "image base", "size of heap reserve",
"major operating system version", "section alignment", "loader flags",
"minor subsystem version", "major linker version", "size of stack commit",
"size of code", "size of image", "number of rva and sizes", "dll charactersitics",
"file alignment", "size of stack reserve", "minor linker version", "base of code",
"size uninit data", "entry point address", "size init data", "major subsystem version",
"magic", "checksum", "size of heap commit", "minor image version",
"minor operating system version", "size of headers", "base of data", "size of image base var",
"data dir base relocation rva", "data dir base relocation size", "data dir debug rva",
"data dir debug size", "data dir exception table rva", "data dir exception table size",
"data dir export table rva", "data dir export table size", "data dir import address table rva",
"data dir import address table rva", "data dir import address table size",
"data dir import table rva", "data dir import table size", "data dir import table size",
"data dir resource table rva", "data dir resource table size", "data dir tls table rva",
"data dir tls table size"]
file_header = []
optional_header = {}
for col in fdf.columns:
if len(fdf[col].unique()) == 1:
if fdf[col].unique()[0] != -1:
lower = [s for s in col if s.islower()]
if fdf[col].unique()[0] != -1 or (len(lower) == len(col)):
if col in file_header_columns:
file_header.append(col)
if col in optional_header_columns:
optional_header[col] = struct.pack("<I", int(fdf[col].unique()[0])).encode('hex')
if len(fdf[col].unique()) > 1:
if col not in optional_header_columns:
continue
if type(fdf[col].unique()[0]) == str or len(fdf[col].unique()) > 9:
continue
u = []
z = []
for value in fdf[col].unique():
u.append(struct.pack("<I", value).encode("hex"))
for d in zip(*u):
match = True
for idx in range(1,len(d)):
if d[0] != d[idx]:
match = False
break
if match:
z.append(d[0])
else:
z.append('?')
string = ''.join(z)
if string != '????????':
optional_header[col] = string
if len(file_header) > 0:
sig.add_file_header(file_header)
if len(optional_header) > 0:
sig.add_optional_header_with_values(optional_header)
print sig.get_signature()
Since we've got one method of clustering to Yara signature down, let's take a brief look at what happens to the cluster shapes/distributions with some other types of cluster algoritms.
Next up, KMeans. It will put every sample into a cluster, and this algorithm the number of clusters needs to be specified. There are a bunch of ways you can determine how many clusters, below we went with a simple one from Wikipedia (http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).
In [32]:
from sklearn.cluster import KMeans
X = df.as_matrix(cols)
X = scale(X)
#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = int(math.sqrt(int(len(X)/2)))
kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]
print "Number of clusters: %d" % nclusters
In [33]:
df.cluster.value_counts().head(10)
Out[33]:
In [34]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)
figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-4,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("K-Means Clusters (zoomed in)")
plt.show()
In [35]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)
#rule of thumb of k = sqrt(#samples/2), thanks wikipedia :)
k_clusters = 22
kmeans = KMeans(n_clusters=k_clusters)
kmeans.fit(X)
labels1 = kmeans.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
kmeans_df = df[['filename', 'cluster']]
print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)
figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-3,-1)
ax.set_ylim(20,35)
ax.set_zlim(-3,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=kmeans_df['cluster'], s=50)
ax.set_title("KMeans Clusters (zoomed in)")
plt.show()
Above you can see how scaling and PCA lead to a bit more balanced layout of some of the clusters, but we've still got some outliers. Not a huge deal, just another way to slice and look at the data.
Let's see how kmeans did clustering files.
In [37]:
kmeans_vt_df = pd.merge(kmeans_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = kmeans_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)
Out[37]:
In [38]:
clusters = set()
print "Total Number of Clusters: %s\n" % (len(kmeans_vt_df['cluster'].unique().tolist()))
for name, blah in kmeans_vt_df.groupby(['cluster', 'label'])['label']:
if name[0] in clusters:
print "%s Cluster has both Malicious and Non-Malicious Samples" % name[0]
clusters.add(name[0])
Below we're looking at MeanShift. Scikit learn is nice enough to tell us a bit about MeanShift usecases (Many clusters, uneven cluster size, non-flat geometry). This seems to, once again, fit our data pretty well. Maybe we can get some better/different layouts of clusters here.
In [39]:
from sklearn.cluster import MeanShift, estimate_bandwidth
X = df.as_matrix(cols)
X = scale(X)
ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)
labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
meanshift_cluster_df = df[['filename', 'cluster']]
print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters
In [40]:
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)
figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()
In [41]:
df.cluster.value_counts().head(10)
Out[41]:
In [42]:
X = df.as_matrix(cols)
X = scale(X)
X = PCA(n_components=n_comp).fit_transform(X)
ebw = estimate_bandwidth(X)
ms1 = MeanShift(bandwidth=ebw)
ms1.fit(X)
labels1 = ms1.labels_
df['cluster'] = labels1
labels1_u = np.unique(labels1)
nclusters = len(labels1_u)
cluster_df = df[['filename', 'cluster']]
print "Estimated Bandwidth: %s" % ebw
print "Number of clusters: %d" % nclusters
print
print "Cluster/Sample Layout"
print df.cluster.value_counts().head(10)
print
# Once again we can remove, in this case, the largest cluster for a less dense graph
tempdf = df[df['cluster'] != 0].reset_index(drop=True)
X = tempdf.as_matrix(cols)
X = scale(X)
X = PCA(n_components=3).fit_transform(X)
figsize(12,8)
fig = plt.figure(figsize=plt.figaspect(.5))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters")
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.set_xlim(-10,-2)
ax.set_ylim(10,20)
ax.set_zlim(-5,-1)
ax.scatter(X[:,0], X[:,1], X[:,2], c=tempdf['cluster'], s=50)
ax.set_title("MeanShift Clusters (zoomed in)")
plt.show()
In [43]:
ms_vt_df = pd.merge(cluster_df, df_vt, on='filename', how='outer')
kmeans_cluster_results = ms_vt_df.groupby(['cluster', 'label']).count()
kmeans_cluster_results[['filename']].head(10)
Out[43]:
It seems we've run into a similar case with MeanShift as with DBSCAN. Instead of being unlabed, we wound up with one cluster with the vast majority of samples. Unfortunately, using PCA doesn't help very much, and of the samples remain in that one large cluster.
Overall, it's important to see how using different algorithms can impact the end result. Understanding that impact when trying to transfer knowledge from one domain to another is also important. This way it's possible to see how the various cluster techniques can lead to different Yara signatures which will fire on different sets of files. When dealing with large amounts of malware, this is one way to group existing and detect new potential variants of the same family.
Good luck and happy hunting!